We start by loading a couple of packages for data manipulation, dimension reduction and fancy representations.
A single-nucleotide polymorphism is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of 0.5% from person to person in the population. They are coded as 0, 1 or 2 (meaning 0, 1 or 2 allels different regarding the reference population)
See the grat wikipedia page for detail!
We can measure SNP for individuals with high trhoughput technology and SNP array. SNP chips for human contains more than 1 million variables! We only suggest to analyse a sample of a data set containing the 5500 most variant SNP for 728 individuals with various origin, with the following descriptors:
The data are imported as follows:
load("SNP.RData")
snp <- data$Geno %>% as_tibble() %>%
add_column(origin = data$origin, .before = 1) The first column is a categorical variable describing the orgin of each individual, with details on the acronyme given above
I do not scale, since SNP value are suppose to live on the same scale (values in \(\{0, 1, 2\}\)).
## Warning in PCA(snp, quali.sup = 1, scale.unit = FALSE, graph = FALSE, ncp =
## 500): Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package
Frist axes more informative than the other, but information is generally well spread.
Argument habillage or col.ind will have the same effect, by the first will be more useful later.
Impressive how the population are well separated!
Just example, you can do better/different than that!
Depending on the proximity of the group to the cloud and to some particular existing groups, the fit is more or less altered.
## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "MKK"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package
## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "TSI"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package
## Warning in PCA(snp, quali.sup = 1, ind.sup = which(snp$origin == "GIH"), :
## Missing values are imputed by the mean of the variable: you should use the
## imputePCA function of the missMDA package
The MNIST dataset is an acronym that stands for the Modified National Institute of Standards and Technology dataset. It is a dataset of 60,000 small square 28×28 pixel grayscale images of handwritten single digits between 0 and 9. It is commonly used for training various image processing systems.[1][2] The database is also widely used for training and testing in the field of machine learning.